Week 12
Causal Inference from Observational Data

SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research


Semester 1, 2026
Last updated: 2026-01-23

Francesco Bailo

Acknowledgement of Country

I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.

Learning Objectives

By the end of this lecture, you will be able to:

  • Understand and create Directed Acyclic Graphs (DAGs)
  • Recognise confounders, mediators, and colliders
  • Apply difference-in-differences estimation
  • Use propensity score matching to balance treatment groups
  • Implement regression discontinuity designs
  • Understand the instrumental variables approach

This Week’s Readings

TSwD Chapter 14

  • 14.2 Directed Acyclic Graphs (DAGs)
  • 14.4 Difference-in-differences
  • 14.5 Propensity score matching
  • 14.6 Regression discontinuity design
  • 14.7 Instrumental variables

ROS Chapters 18-19

  • Ch 18: Causal inference and randomised experiments
  • Ch 19: Causal inference using regression on the treatment variable

The Challenge of Causal Inference

From Experiments to Observational Data

The Fundamental Problem of Causal Inference

We can never observe both potential outcomes for the same unit—what happened AND what would have happened under a different treatment.

In randomised experiments, random assignment ensures treatment and control groups are comparable.

But what if we cannot randomise? We need methods to estimate causal effects from observational data.

Potential Outcomes Framework

For any unit \(i\), we define:

  • \(y_i^0\) = outcome if unit receives control
  • \(y_i^1\) = outcome if unit receives treatment
  • \(\tau_i = y_i^1 - y_i^0\) = individual treatment effect

The problem: We only ever observe ONE of these potential outcomes:

\[y_i = y_i^0(1 - z_i) + y_i^1 z_i\]

where \(z_i\) indicates treatment assignment.

Average Treatment Effects

We typically estimate average causal effects:

Estimand Definition
SATE (Sample Average Treatment Effect) \(\frac{1}{n}\sum_{i=1}^{n}(y_i^1 - y_i^0)\)
PATE (Population Average Treatment Effect) \(\frac{1}{N}\sum_{i=1}^{N}(y_i^1 - y_i^0)\)
CATE (Conditional Average Treatment Effect) Average effect for a subgroup

Why Simple Comparisons Fail

Self-Selection Bias

When treatment groups differ systematically in ways that also affect the outcome, simple comparisons are misleading.

Example: Comparing outcomes of people who chose to take supplements vs. those who didn’t.

  • Those who take supplements may be more health-conscious
  • They may have higher incomes, better diets, more exercise
  • Any observed difference conflates treatment effect with pre-existing differences

Directed Acyclic Graphs (DAGs)

What Are DAGs?

Directed Acyclic Graphs are visual representations of causal relationships:

  • Nodes represent variables
  • Arrows represent causal relationships
  • Acyclic means no feedback loops
digraph D {
  rankdir=LR;
  node [shape=plaintext, fontname="helvetica"];
  X -> Y;
}

D X X Y Y X->Y

This DAG says: “X causes Y”

Building DAGs in R

library(ggdag)

simple_dag <- dagify(
  Y ~ X,
  coords = list(
    x = c(X = 0, Y = 1),
    y = c(X = 0, Y = 0)
  )
)

ggdag(simple_dag) +
  theme_dag()

Confounders

A confounder is a variable that:

  1. Causes the treatment variable
  2. Causes the outcome variable
digraph D {
  node [shape=plaintext, fontname="helvetica"];
  Education [label="Education"];
  Income [label="Income"];
  Happiness [label="Happiness"];
  
  {rank=same Income Happiness};
  
  Education -> Income;
  Education -> Happiness;
  Income -> Happiness;
}

D Education Education Income Income Education->Income Happiness Happiness Education->Happiness Income->Happiness

Confounders must be controlled for!

Failing to adjust for confounders creates a “backdoor path” that biases our causal estimate.

Confounder Example in R

confounder_dag <- dagify(
  Happiness ~ Income + Education,
  Income ~ Education,
  exposure = "Income",
  outcome = "Happiness"
)

ggdag_status(confounder_dag) +
  theme_dag() +
  guides(colour = "none")

Mediators

A mediator is a variable that lies on the causal pathway between treatment and outcome:

digraph D {
  node [shape=plaintext, fontname="helvetica"];
  Income [label="Income"];
  Children [label="Children"];
  Happiness [label="Happiness"];
  
  {rank=same Income Happiness};
  
  Income -> Happiness;
  Income -> Children;
  Children -> Happiness;
}

D Income Income Children Children Income->Children Happiness Happiness Income->Happiness Children->Happiness

Do NOT control for mediators!

Controlling for a mediator blocks part of the causal effect you’re trying to estimate.

Colliders

A collider is a variable caused by both the treatment and the outcome:

digraph D {
  node [shape=plaintext, fontname="helvetica"];
  Income [label="Income"];
  Exercise [label="Exercise"];
  Happiness [label="Happiness"];
  
  {rank=same Income Happiness};
  
  Income -> Happiness;
  Income -> Exercise;
  Happiness -> Exercise;
}

D Income Income Exercise Exercise Income->Exercise Happiness Happiness Income->Happiness Happiness->Exercise

Do NOT control for colliders!

Controlling for a collider opens a spurious path and creates bias where none existed.

DAG Summary: What to Control For?

Variable Type Relationship Control for it?
Confounder Causes both treatment and outcome ✓ Yes
Mediator On the causal path from treatment to outcome ✗ No
Collider Caused by both treatment and outcome ✗ No

Key Insight

More controls are not always better! You must think carefully about the causal structure.

Difference-in-Differences

The Idea Behind DiD

Difference-in-differences compares:

  1. Changes over time in a treatment group
  2. Changes over time in a control group

The treatment effect = (difference in treatment group) − (difference in control group)

\[\hat{\tau}_{DiD} = (\bar{Y}_{T,after} - \bar{Y}_{T,before}) - (\bar{Y}_{C,after} - \bar{Y}_{C,before})\]

Visual Intuition for DiD

DiD in R: Simulated Example

set.seed(853)
n <- 1000

# Simulate DiD data
did_sim <- tibble(
  person = rep(1:n, 2),
  time = rep(c(0, 1), each = n),
  treated = rep(sample(0:1, n, replace = TRUE), 2)
) |>
  mutate(
    # Outcome depends on time, treatment group, and their interaction
    outcome = 5 + 
      2 * time +           # Time trend
      3 * treated +        # Group difference
      4 * time * treated + # Treatment effect!
      rnorm(n * 2)
  )

DiD Regression Model

The DiD model is a regression with an interaction term:

\[Y_{it} = \beta_0 + \beta_1 \cdot \text{Time}_t + \beta_2 \cdot \text{Treatment}_i + \beta_3 \cdot (\text{Time} \times \text{Treatment})_{it} + \epsilon_{it}\]

  • \(\beta_1\): Time trend for control group
  • \(\beta_2\): Baseline difference between groups
  • \(\beta_3\): The treatment effect (difference-in-differences)

DiD Results

did_model <- lm(outcome ~ time * treated, data = did_sim)
modelsummary(did_model, 
             coef_rename = c("time" = "Time", 
                            "treated" = "Treatment Group",
                            "time:treated" = "DiD Effect (Time × Treatment)"),
             gof_omit = "IC|Log|F|RMSE")
(1)
(Intercept) 5.041
(0.043)
Time 2.015
(0.061)
Treatment Group 2.986
(0.062)
DiD Effect (Time × Treatment) 3.897
(0.088)
Num.Obs. 2000
R2 0.917
R2 Adj. 0.917

DiD: Key Assumptions and Threats

Threats to validity:

  1. Non-parallel trends: Groups were already diverging
  2. Compositional changes: Who’s in each group changes over time
  3. Spillover effects: Treatment affects control group
  4. Anticipation effects: Behaviour changes before treatment

Best Practice

Always visualise pre-treatment trends and discuss why parallel trends is plausible.

Propensity Score Matching

The Matching Idea

Goal: Create treatment and control groups that are similar on observed characteristics.

Problem: With many covariates, exact matching is impossible.

Solution: Match on a single number—the propensity score.

\[e(X) = P(\text{Treatment} = 1 | X)\]

The probability of receiving treatment given observed covariates.

How Propensity Score Matching Works

  1. Estimate the propensity score (usually with logistic regression)
  2. Match treated units to control units with similar propensity scores
  3. Check balance on covariates between matched groups
  4. Estimate the treatment effect using the matched sample

PSM Example: Simulating Data

set.seed(853)
n <- 1000

psm_data <- tibble(
  id = 1:n,
  age = sample(18:65, n, replace = TRUE),
  income = rnorm(n, 50000, 15000)
) |>
  mutate(
    # Treatment probability depends on age and income
    prop_score = plogis(-2 + 0.03 * age + 0.00003 * income),
    treatment = rbinom(n, 1, prop_score),
    # Outcome depends on treatment AND confounders
    outcome = 10 + 5 * treatment + 0.1 * age + 0.0001 * income + rnorm(n, 0, 5)
  )

Naive Estimate vs. True Effect

# Naive comparison (biased)
naive_effect <- mean(psm_data$outcome[psm_data$treatment == 1]) - 
                mean(psm_data$outcome[psm_data$treatment == 0])

cat("Naive estimate:", round(naive_effect, 2), "\n")
Naive estimate: 5.19 
cat("True effect: 5")
True effect: 5

Warning

The naive estimate is biased because treated and control groups differ on age and income!

Using MatchIt in R

library(MatchIt)

# Perform propensity score matching
matched <- matchit(
  treatment ~ age + income,
  data = psm_data,
  method = "nearest",     # Nearest neighbour matching
  distance = "glm"        # Logistic regression for propensity score
)

matched
A `matchit` object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score
             - estimated with logistic regression
 - number of obs.: 1000 (original), 686 (matched)
 - target estimand: ATT
 - covariates: age, income

Checking Balance

# Visual balance check
plot(matched, type = "jitter", 
     interactive = FALSE)

Estimating the Treatment Effect

# Get matched data
matched_data <- match.data(matched)

# Estimate treatment effect on matched sample
psm_model <- lm(outcome ~ treatment + age + income, data = matched_data)

modelsummary(psm_model, 
             coef_rename = c("treatment" = "Treatment Effect"),
             gof_omit = "IC|Log|F|RMSE")
(1)
(Intercept) 9.843
(1.017)
Treatment Effect 4.028
(0.480)
age 0.094
(0.016)
income 0.000
(0.000)
Num.Obs. 686
R2 0.364
R2 Adj. 0.361

PSM Limitations

Key Limitations

  1. Unobserved confounders: Can only match on what we observe
  2. Model dependence: Results depend on how propensity score is estimated
  3. Common support: Need overlap in propensity scores between groups

“Propensity score matching cannot match on unobserved variables… it is difficult to understand why individuals that appear to be so similar would have received different treatments, unless there is something unobserved.” — TSwD

Regression Discontinuity Design

The RDD Idea

Regression Discontinuity Design exploits situations where treatment is assigned based on a cutoff in a continuous variable (the “running variable” or “forcing variable”).

Examples:

  • Students scoring ≥80% get an A
  • Age 21 allows legal drinking
  • Income below threshold qualifies for welfare

Key insight: People just above and just below the cutoff are essentially identical, except for treatment!

Sharp RDD: Visual Intuition

RDD Model

The basic RDD model:

\[Y_i = \alpha + \tau \cdot \text{Treatment}_i + \beta \cdot \text{RunningVar}_i + \epsilon_i\]

Or, centring the running variable at the cutoff \(c\):

\[Y_i = \alpha + \tau \cdot D_i + \beta \cdot (X_i - c) + \epsilon_i\]

where \(D_i = 1\) if \(X_i \geq c\).

The coefficient \(\tau\) is the treatment effect at the cutoff.

RDD in R: Simulated Example

set.seed(853)
n <- 1000

rdd_sim <- tibble(
  mark = runif(n, 70, 90),
  got_scholarship = ifelse(mark >= 80, 1, 0),
  # True effect of scholarship = 8 points on future performance
  future_score = 40 + 0.5 * mark + 8 * got_scholarship + rnorm(n, 0, 3)
)

Estimating the RDD Effect

# Centre the running variable
rdd_sim <- rdd_sim |>
  mutate(mark_centred = mark - 80)

# RDD regression
rdd_model <- lm(future_score ~ got_scholarship + mark_centred, 
                data = rdd_sim)

modelsummary(rdd_model,
             coef_rename = c("got_scholarship" = "Scholarship Effect",
                            "mark_centred" = "Mark (centred)"),
             gof_omit = "IC|Log|F|RMSE")
(1)
(Intercept) 80.122
(0.207)
Scholarship Effect 7.961
(0.368)
Mark (centred) 0.487
(0.032)
Num.Obs. 1000
R2 0.833
R2 Adj. 0.832

RDD Assumptions

Key Assumptions

  1. No manipulation: Units cannot precisely manipulate their running variable to cross the threshold
  2. Continuity: The relationship between running variable and outcome is continuous (except for the treatment jump)

Check for manipulation: Look for bunching just above or below the cutoff.

RDD: Sharp vs. Fuzzy

Type Description
Sharp RDD Treatment perfectly determined by cutoff (everyone above cutoff is treated)
Fuzzy RDD Cutoff changes probability of treatment but doesn’t guarantee it

Fuzzy RDD requires instrumental variables methods (using the cutoff as an instrument).

Instrumental Variables

The IV Challenge

Sometimes we have:

  • Confounders we cannot observe or measure
  • Treatment that is endogenous (correlated with the error term)

Solution: Find an instrumental variable that:

  1. Is correlated with the treatment (relevance)
  2. Affects the outcome ONLY through the treatment (exclusion restriction)

IV Intuition

digraph D {
  node [shape=plaintext, fontname="helvetica"];
  
  Instrument -> Treatment;
  Treatment -> Outcome;
  Confounder -> Treatment;
  Confounder -> Outcome;
  
  {rank=same Treatment Outcome};
}

D Instrument Instrument Treatment Treatment Instrument->Treatment Outcome Outcome Treatment->Outcome Confounder Confounder Confounder->Treatment Confounder->Outcome

The instrument provides variation in treatment that is unrelated to the confounder.

Classic IV Example: Cigarette Taxes

Question: Does smoking cause lung cancer?

Problem: Smokers differ from non-smokers in many ways (confounders).

Solution: Use cigarette taxes as an instrument.

  • Taxes affect smoking rates (relevance ✓)
  • Taxes don’t directly cause cancer (exclusion restriction ✓)

IV Estimation: Two-Stage Least Squares

Stage 1: Regress treatment on instrument

\[\text{Smoking}_i = \alpha_1 + \gamma \cdot \text{Tax}_i + \epsilon_{1i}\]

Stage 2: Regress outcome on predicted treatment

\[\text{Cancer}_i = \alpha_2 + \beta \cdot \widehat{\text{Smoking}}_i + \epsilon_{2i}\]

The coefficient \(\beta\) is the causal effect.

IV in R: Simulated Example

set.seed(853)
n <- 2000

iv_sim <- tibble(
  # Instrument: tax rate (varies by province)
  tax_rate = sample(c(0.3, 0.4, 0.5), n, replace = TRUE),
  # Unobserved confounder
  health_conscious = rnorm(n),
  # Treatment: affected by tax and confounder
  smoking = 10 - 5 * tax_rate - 2 * health_conscious + rnorm(n),
  # Outcome: affected by smoking and confounder
  # True causal effect of smoking = -3
  health = 80 - 3 * smoking + 5 * health_conscious + rnorm(n, 0, 5)
)

Naive vs. IV Estimates

# Naive OLS (biased due to confounding)
naive_model <- lm(health ~ smoking, data = iv_sim)

# IV estimation using ivreg (or manually)
library(estimatr)
iv_model <- iv_robust(health ~ smoking | tax_rate, data = iv_sim)

cat("Naive estimate:", round(coef(naive_model)["smoking"], 2), "\n")
Naive estimate: -4.91 
cat("IV estimate:", round(coef(iv_model)["smoking"], 2), "\n")
IV estimate: -2.43 
cat("True effect: -3")
True effect: -3

IV Assumptions

Two Critical Assumptions

  1. Relevance: The instrument must affect the treatment
    • Can be tested statistically (first stage F-statistic > 10)
  2. Exclusion Restriction: The instrument affects outcome ONLY through treatment
    • Cannot be tested—must be argued theoretically

Finding valid instruments is hard. Many purported instruments fail the exclusion restriction.

Summary and Comparison

Methods Overview

Method Key Assumption Data Requirement
DiD Parallel trends Before/after data for treatment and control groups
PSM Selection on observables Rich set of pre-treatment covariates
RDD No manipulation; continuity Running variable with cutoff
IV Exclusion restriction Valid instrument

Decision Framework

digraph D {
  node [shape=box, fontname="helvetica"];
  
  Q1 [label="Is there a\nnatural cutoff?"];
  Q2 [label="Is there a\nvalid instrument?"];
  Q3 [label="Is there before/after\ndata with control group?"];
  Q4 [label="Can you measure\nall confounders?"];
  
  RDD [label="RDD", shape=ellipse];
  IV [label="IV", shape=ellipse];
  DiD [label="DiD", shape=ellipse];
  PSM [label="PSM", shape=ellipse];
  Caution [label="Be very\ncautious", shape=ellipse];
  
  Q1 -> RDD [label="Yes"];
  Q1 -> Q2 [label="No"];
  Q2 -> IV [label="Yes"];
  Q2 -> Q3 [label="No"];
  Q3 -> DiD [label="Yes"];
  Q3 -> Q4 [label="No"];
  Q4 -> PSM [label="Yes"];
  Q4 -> Caution [label="No"];
}

D Q1 Is there a natural cutoff? Q2 Is there a valid instrument? Q1->Q2 No RDD RDD Q1->RDD Yes Q3 Is there before/after data with control group? Q2->Q3 No IV IV Q2->IV Yes Q4 Can you measure all confounders? Q3->Q4 No DiD DiD Q3->DiD Yes PSM PSM Q4->PSM Yes Caution Be very cautious Q4->Caution No

R Packages for Causal Inference

# DAGs
library(ggdag)       # Visualise and analyse DAGs
library(dagitty)     # DAG algebra

# Difference-in-Differences
library(did)         # Callaway and Sant'Anna estimator
library(fixest)      # Fast fixed effects estimation

# Propensity Score Matching
library(MatchIt)     # Matching methods
library(cobalt)      # Balance assessment

# Regression Discontinuity
library(rdrobust)    # Robust RDD estimation
library(rddensity)   # Manipulation testing

# Instrumental Variables
library(estimatr)    # iv_robust() function
library(ivreg)       # 2SLS estimation

Key Takeaways

  1. Causal inference from observational data requires strong assumptions
    • No method is a “magic bullet”
    • Transparency about assumptions is crucial
  1. DAGs help clarify causal thinking
    • Control for confounders, not mediators or colliders
  1. Choose methods based on your setting
    • DiD: Parallel trends plausible
    • PSM: Selection on observables
    • RDD: Natural cutoff exists
    • IV: Valid instrument available

Next Week

Week 13: Advanced Applications and Best Practices

  • Data sharing and documentation
  • Cross-validation and model validation
  • Multilevel regression with post-stratification (MRP)
  • Course synthesis and best practices

References